| Feeds | Fetches | Targets | #(Devices) | Count |
|---|
| Tensor | Count | DType | Shape | Value |
Health Pill
|
|---|
|
|
|
|
|
|
Alerts are sorted from top to bottom by increasing timestamp.
| First Offense | Tensor (Device) | Event Counts |
|---|
Graph |
(* = expandable) |
|
|
Namespace*
?
|
|
|
OpNode
?
|
|
|
Unconnected series*
?
|
|
|
Connected series*
?
|
|
|
Constant
?
|
|
|
Summary
?
|
|
|
Dataflow edge
?
|
|
|
Control dependency edge
?
|
|
|
Reference edge
?
|
If you'd like to share your visualization with the world, follow these simple steps. See this tutorial for more.
Host tensors, metadata, sprite image, and bookmarks TSV files publicly on the web.
One option is using a github gist. If you choose this approach, make sure to link directly to the raw file.
| Checkpoint: | |
| Metadata: |
Iteration: 0
For fast results, the data will be sampled down to points.
PCA is approximate.
Section 1: Summary of input-pipeline analysis
The overall performance is input-bounded because on average (over all steps) the device has spent % of time (standard deviation = ) waiting for input data [see Section 2 below for details].
Recommendation for next step:
Section 2: Device-side analysis details
Section 2.1: Device step time
|
Device step-time statistics (in ms) Average: ms (σ = ms) Range: - ms |
milliseconds |
training step number |
Section 2.2: Range of device time waiting for input data across cores at each step
|
% of device step time waiting for input data (average over the maximum waiting time across cores at each step) Average: % (σ = %) Range: - % |
% of device step time |
training step number |
Section 3: Host-side analysis details
What can be done to reduce above components of the host input time:
Click the "Show" button below to see the source data of the breakdown.
| Input Op | Count | Total Time (in ms) | Total Time (as % of total input-processing time) | Total Self Time (in ms) | Total Self Time (as % of total input-processing time) | Category |
|---|
Average step time (lower is better): ms (standard deviation = ms)
Host idle time (lower is better):
TPU idle time (lower is better):
Utilization of TPU Matrix Units (higher is better):
milliseconds |
training step number |
| Time (%) | Cumulative time (%) | Category | Operation | GFlops/sec |
|---|
Number of Hosts used:
TPU type: Cloud TPU
Number of TPU cores:
Batch size:
Modifying your model's architecture, data dimensions, and improving the efficiency of CPU operations may help reach the TPU's FLOPS potential.
| Checkpoint: | / |
| Metadata: | / |
Iteration: 384
For fast results, the data will be sampled down to 10,000 points.
PCA is approximate.